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Concept 
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[Article by Ulrich Trottenberg of Suprenum GmbH. 
Bonn: “Suprenum—The Concept’ ] 


[Text] The Suprenum 5-Gflops supercomputer 1s the 
result of a national German project which includes an 
MIMD [multiple instruction, multiple data] multi- 
vector processor hardware with distributed memary and 
the development of all software layers that guarantee 
comfortable and efficient exploitation of the scalable 
hardware. Also, a large amount of application software 
has been developed on the basis of superfast parallel 
algorithms. The applications cover all typical models in 
scientific Computing. 


Suprenum 1s an unusual combination of a widespread, 
long-range research-oriented activity and a strictly prod- 
uct-oriented development. The research idea was to 
bring together: 

— the users of supercomputers representing the know- 
how for the grand challenge problems (““superprob- 
lems’’) in scientific computing and numerical simu- 
lation; 

— the computer architects representing the know-how 
on parallel architectures, parallel languages, and tools 
for parallel computing: 

— the numerical analysts representing the know-how on 
fast numerical algorithms (like multigrid and multi- 
level approaches) and their “superfast” parallel ver- 
SIONS. 


Suprenum combines these three fields. In its product- 
oriented part, it consequently develops a system that 
integrates: 

— hardware: 

— the operating and the run-time system; 

— programming environment; 

— parallelization (partitioning and vectorization) tools; 
— basic and advanced application software. 


About a third of the manpower in the Suprenum project 
is devoted to the hardware, another third to system 
software, programming environment and tools; and the 
last third is used in application software. 


The Suprenum Hardware Essentials and Software 
Developments 


The Suprenum prototype system which will be opera- 
tional by the end of 1989 is, with respect to its hardware, 
characterized by the following essentials (see Figure 1): 
— highly parallel MIMD architecture with a peak per- 
formance of 5 Gflop/s: 
— 256 computing nodes aggregated into 16 clusters: 
— each node with 8 Mbyte private memory (giving an 
overall main memory of 256 x 8 Mbyte = 2 Gbyte): 
— each node with a vector floating-point unit (20 
Mflop/s, if chaining 1s used): 


flexible two-level (intra- and inter-cluster) intercon- 
nection network on the basis of very tast busses. 
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Figure 1. Structure of the Suprenum 1 prototype with 
256 nodes in 16 clusters. 


The timetable for the hardware development 1s as 
follows: After the first ideas in 1984. most of the 
essential architecture decisions were made in 1985. In 
1986, Suprenum GmbH was founded. and work essen- 
tially commenced. A preliminary system (10 nodes in 
two clusters) with nearly the full functionality was 
available in 1987. The final node with full performance 
was running in 1988 


This year, 1989, a 32-node-system (two full clusters) 
will be in operation at the Hannover fair in April, and 
the final 256 node prototype (16 full clusters) will be 
operational by the end of this year. 


The development of system software, programming 
environment, compilers, tools, and application soft- 
ware was carried out essentially in parallel with the 
hardware development. Actually most of the software 
will already be finalized before the hardware is fully 
operational. In order to be able to achieve this, it was 
extremely important to make s/mu/ators available that 
allow, for example, Suprenum application software (in 
Suprenum-Fortran) to be developed and tested on 
other computers. Furthermore, an essential part of the 
project was devoted to analyse the system behavior 
and its influence on the overall system performance for 
each single hardware and software component. On the 
basis of these tools, very precise performance predic- 
tions were made which, in turn, allowed tuning and 
optimizing the system essentials and the software. 
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Suprenum in the Supercomputer World 


In order to embed Suprenum into the supercomputer 
world. we distinguish & classes of architectures (see 
Figure 2) 


|. SJMD [single instruction, multiple data] versus MIMD 
(vertical line). SIMD operation mode means that parallel 
or pipelined functional units execute the same tnstruc- 
tion sequence on different data 


The MIMD principle ts the favorite operation mode tor 
multiprocessors based on independent complete proces- 
sors. Each processor may execute a different instruction 
Stream within the same application 


Shared versus distributed memory (horizontal line). 


One of the central problems to be solved in the design of 


multiprocessor systems 1s the memory access. Basically, 

there are two possibilities to organize this 

— shared memory (sm) guarantees fair access to a global 
memory for each processor: 

— distributed memory (dm) means that each processor 
has direct access only to its Own private memory. 


Often both memory organization types are combined in 
hierarchical memory systems 


Our classification reflects the user's view of the memory 
organization rather than its hardware realization. 


3. Scalar versus vector floating-point units (dashed hori- 
zontal lines). Presently, scalar floating-point units seem 
to be restricted to a floating-point performance of less 
than 10 Mflop/s. The most cost-effective way to achieve 
higher floating-point rates 1s vector processing. There- 
fore, the most powerful architectures today are mixed 
MIMD/SIMD multiprocessor systems (class 4 and 8). 
The efficient use of these architectures requires paral- 
lelism on two levels: the coarse grain parallelism related 
to the global MIMD structure and the fine grain paral- 
lelism which ensures efficient vector processing locally. 


In the following we briefly describe the 8 classes of 


supercomputer architectures and name typical represen- 
tatives of them. 


Class 1: Scalar Computers 


The “traditional” Von Neumann computer architecture 
(SISD [single instruction, single data]) 1s the basis for 
maintrames, minicomputers, and microcomputers. 
Using the current hardware technology, the floating- 
point performance of this architecture seems to be lim- 
ited to 10 Mflop/s, which 1s much less than current 
supercomputer performance 


Class 2: Vector Computers 


Historically the first machines to be called supercom- 
puters were vector computers. Their hardware architec- 
ture 18 based on very fast arithmetic pipelines which 
support the rapid execution of vector instructions oper- 
ating on all components of vector operands simulta- 
neously. Vectors in that sense consist of components 
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which can be processed independently. Hence, vector 
processing 1s a special form of parallel processing based 
on fine grain parallelism. Application codes have to be 
vectorized (1.¢., operations are defined on vectors and 
certain data dependencies between operations are 
excluded) in order to exploit the potential speed of the 
hardware. The need for vectorization resulted in new 
vector algorithms and in special compiler tools (vector- 
izers) for the automatic vectorization of existing codes. 


Examples for vector machines are Cray-l, Cyber 205, 
Fujitsu VP, NEC-SX. Hitachi 8-810, and the IBM 3090- 
VF. 


Due to technological progress in VLSI chip develop- 
ment, vector computer architectures today can be real- 
ized in standard microcomputer technology. These sys- 
tems are smaller, somewhat slower and considerably 
cheaper than the classical vector computers and there- 
fore called minisupercomputers. The vector- 
minisupercomputers take advantage of the existing soft- 
ware and tools for vector machines: some systems are 
even Cray-compatible. Examples are Convex Cl and 
SCS-40. 


Class 3: Scalar SM-Multiprocessors 


Another way to increase the computing performance 1s 
the combination of several single processors to a multi- 
processor system and to replace the sequential pro- 
cessing by parallel processing. The optimal degree of 
parallelism (fine or coarse granularity) depends on the 
number and the power of the single processors as well as 
on the memory organization. The shared memory con- 
cept restricts the number of CPUs to less than or equal to 
8 today (e.g. the Allhant). If the memory 1s accessed via a 
network, a larger number of CPUs can be connected at 
the cost of longer access times. Examples are the IBM 
RP-3 and the Cedar project (= clusters of Alliant sys- 
tems). Further examples in this class are the Sequent, 
Flexible, Encore, and Concurrent Computers machines. 


Class 4: Vector SM- Multiprocessors 


The step from a single processor to a multiprocessor 
system (class | to class 3) 1s, of course, also possible and 
obvious for vector computers (class 2). Similarly as for 
scalar multiprocessors, the performance 1s increased by 
composing several vector CPUs to multiprocessor sys- 
tems with the same memory access problems. The shared 
memory concept limits the number of vector processors 
(today less than or equal to 8). The parallelism on these 
systems 1s often used to increase the throughput of the 
systems (running different jobs on different CPUs). 
MIMD parallel as well as SIMD-like processing 1s also 
possible (e.g. on the Cray X-MP using macrotasking or 
microtasking constructs). Representatives of this class 
are the Cray X/Y-MP. Cray-2. and the ETA-10. 


Class 5: Scalar Array Processors 


The era of parallel computers started with array proces- 
sors which perform one instruction simultaneously on an 
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array ot operands (in SIMD mode). Recently these 
systems have been upgraded to massively parallel mul- 
tiprocessors (with many thousands of processors). Each 
processor 1s relatively small and weak but the enormous 
degree of parallelism may result in supercomputer per- 
formance. Typically, these systems are used for a 
restricted class of special applications. We mention here 
the historical Iliac IV. the Goodyear MPP. the ICI 
DAP, and the Connection Machine | 


Class 6: Vector Array Processors 


Phe combination of (SIMD) array and vector processing 
has been realized in the Connection Machine 2. which 
presently is the system with the highest floating-point 
performance rate for special applications on very regular 
data structures. 


Class 7: Scalar DM-Multiprocessors 


Today, multiprocessor systems with a large (and princi- 
pally unlimited) number of processors require that the 
memory units are physically associated with the proces- 
sors (distributed memory). The basic unit of such a 
system (a “processing node’, or shortly a “nn ode™) con- 
sists of the CPU, the arithmetic coprocessors, the 
memory, and the communication unit. The first proto- 
types of this class were based on hypercube topologies 
and were built up at the Californian Institute of Tech- 
nology. Intel’s iPSC was the first commercial product, 
followed by Ametek and Ncube. Recently Intel came out 
with its second generation, the 1PSC-2. Multiprocessor 
systems with transputer nodes have also entered the 
market (Meiko, Parsytec). 


Class 8: Vector DM-Multiprocessors 


These systems combine the advantages of the vector and 
the parallel processing concepts. The multiprocessor 
architecture 1s derived from the class 7 machines. 
whereas the node architecture 1s taken from low-cost 
vector computers (class 2). The basic idea 1s to combine 
powerful vector nodes having an advantageous cost 
performance ratio with a multiprocessor system. Due to 
the size and the cost of a single node, their number 
is—although principally unlimited—today practically 


limited to several hundreds. The computational speed of 


the nodes, of course, imposes strong requirements on the 
speed of the communication. If the communication 
problem is solved satisfactorily, these machines are the 
most powerful supercomputers existing today. Systems 
currently entering the market are Suprenum, the Intel 
IiPSC-VX, and the Ametek 2010. 

The classification of parallel computers in Figure 2 1s by 
no means unique and complete. An important classifi- 
cation category which is not taken into account in Figure 
| is the hardware technology. Systems based on very 
high-speed technology hardware (like the Cray and ETA 
systems) are much more powerful (and expensive) than 
systems based on microcomputer technology (like the 
Alliant), although they belong to the same class 
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Figure 2. Classification of parallel and supercomputer 
architectures. 


The Suprenum Project Organization 


The Suprenum project was conceived—during tts defi- 
nition phase by the Gesellschaft fuer Mathematik und 
Datenverarbeitung mbH [Company for Mathematics and 
Data Processing]—as a big national joint venture. From 
the beginning of the project, the initiators were aware of 
the fact that only by means of a concentrated coopera- 
tion of the leading experts could Germany catch up in 
the international supercomputer developments. Users, 
mathematicians (numerical analysts), system software 
experts, and computer architects had to be brought 
together and be committed to a uniform, clear develop- 
ment goal. 


Thirteen partner institutions were recruited from 
industry, national research laboratories, and universi- 
ties, being involved in the Suprenum project: 


¢ German Research and Experimental Institute for 
Aeronautics and Astronautics (DE VLR). 

¢ Dornier GmbH: 

¢ Company for Mathematics and Data Processing 
mbH. 

e Nuclear research center in Julich: 

e Nuclear research center in Karlsruhe: 

¢ Siemens AG (Power Plant Unit): 

¢ Krupp Atlas Elektronik GmbH: 

¢ Stollmann GmbH. 











Institute of Advanced Technology tn Darmstadt 
e Brunswick Technical University: 

e University of Bonn: 

University of Duesseldort: 

University of Erlangen-Nuremberg 


The contributions of the partners are sponsored by the 
Federal Ministry of Research and Technology (BMET) 


Suprenum GimbH 1s the fourteenth partner. It was 

founded in 1986 due to an initiative of the BMEFT by the 

main development partners Krupp Atlas Elektronik 

GambH, Stollmann GmbH, and Gesellschatt fuer Math- 

ematik und Datenverarbeitung mbH. It is tunded by 

BMET and the Ministry tor Economy and Technology ot 

the State of North Rhine-Westphalia. The primary tasks 

of Suprenum GmbH are: 

— coordination and management of the Suprenum 
project. 

— integration of the hardware and software compo- 
nents which are developed in the project: 

— tundamental research and development. 

— marketing of individual results. especially the 
Suprenum systems: 

— conceptional responsibility for the turther Suprenum 
development 


Perspectives 


With the realization of the Suprenum concept. an 
attempt 1s made to set a standard in the promising area 
of parallel processing. In order to ensure the long-term 
realization of this chance, it 1s important to conduct a 
permanent development 


> 


The next Suprenum generation. Suprenum 2. 1s being 

conceived. The general Suprenum philosophy 1s: 

— to use a large number of processor nodes and 

— to make each node as powerful as possible on the 
basis of VLSI technology 


In order to achieve an optimum cost/performance ratio. 
this will not be changed. 


Generally, the trends in supercomputer developments 
are characterized by an increase of the number of nodes 
for “conventional” supercomputers like the Cray and an 
increase of performance per node for the massively 
parallel computers like the Connection Machine. Thus, a 
convergence of architectures can be expected, with 
Suprenum in the middle of these trends 


The general concepts like the abstract Suprenum 
machine, the programming model, etc. will be main- 
tained: some of them will be extended. One development 
goal of the system software, for example, 1s—among 


other things—the automatization and dynamization of 


process assignment (process migration, etc.), automatic 
optimization and dynamization of load balancing. etc. 


The main emphasis in the field of applications software 


will be placed on the development and parallelization of 
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new, even faster algorithms (e.g. on dynamic and self- 
adaptively modifying grid structures) and further numer- 
ical and also non-numerical application classes. Of 
course, all application software that runs on the 
Suprenum | will also be usable—with correspondingly 
higher pertormance—on the Suprenum 2 

\ long-term research and development goal in the field 
of parallel computing should be to overcome the division 
of the parallel world into computers with a global shared 
memory and computers with distributed local memory 
units 


Qn the hardware side, these differences will disappear it 
multi-level memory hierarchies (the more local. the 
faster) are used. Such developments combine both con- 
cepts and anyway include what a forward-looking 
memory technique requires 


On the software side. the concepts for the presently 
pursued and future architecture lines should be stan- 
dardized so that portability of the application software 1s 
ensured generally and not only within certain architec- 
ture classes. Several areas which are treated in the 
project (communication library, semi-automatic paral- 
lelizer) offer natural and promising approaches for those 
developments 


References 


1. Gilor, Wo. Suprenum—the system, Supercomputer. 
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System 
$HYS0245 Amsterdam SUPERCOMPUTER in Enelish 
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[Article by Wolfgang Gilor: “Suprenum—the System”’] 


[Text] The Suprenum supercomputing hardware consists 
of a scalable number of clusters each containing 16 
vectorprocessors with local memory for high-speed com- 
puting and several special nodes for services within a 
cluster (disk controller, diagnosis, external links). Inter- 
processor communications based on a hierarchical bus 
concept. a parallel high-speed bus within a cluster, and a 
toroid system of multiple serial busses between clusters. 
This multiprocessor kernel is handling by a dedicated 
distributed operating system (PEACE) which provides— 
based on teams of light weight processes—fast services 
for message passing, resource management and all other 
functionalities within a multiprocessor kernel. 





The Suprenum Node Architecture 


A Suprenum supercomputer consists of up to 256 *Pro- 
cessing Nodes” (PN). Each PN 1s a complete single-board 
“vector machine’ running its own operating system 
PEACE and communicating with other PNs. A PN con- 
sists of the following major resources (see Figure 1): 
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node CPU (Motorola MC68020, 20 MHz) with 
Paged Memory Management Unit (PMMU) (Motor- 
ola MC 68851) and Scalar Arithmetic Coprocessor 
(MC 68882): 

S Mbyte of Node Memory (DRAM, 35 nsecs static 
column access time): 

pipeline vector processor (IEEE double precision) 
with 2 x 64 Kbyte of vector memory (SRAM, 20 
nsecs access ume), 

[YMA/Address Generator for block transfer of data- 
Structure objects: 

communication coprocessor for internode communi- 
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Figure !. Internal structure of the suprenum processing 
node. 


The node CPU performs the operating system tasks and 
interprets the instructions of a program. For the sake of a 


secured operation, access to the code and the data objects of 


the operating system and user tasks residing in the node 
memory 1s protected by the node PMMU. As the name 
suggests, the node memory 1s paged; however, not in the 
sense of virtual memory 
protection and of providing fast block transfer DMA to the 
data clements in a page in a static column mode of opera- 
tion. Since the PMMU! adds 45 nsec to a memory access, 1 
is employed only on the first entry into a new page. in order 
to exercise access right control, and from then on bypassed 
for all the other accesses to the same page. This 1s feasible 
since a page boundary violation would be detected “on the 
fly by special page boundary watchdog logic 


Rather. paging is a means of 


PA 


The Pipeline Vector Processor (PVP) uses the Weitek 
WTL2264/2265 chip set in connection with a micro- 
coded controller accommodated in one of the ASICs. At 
20 MHz clock frequency, the Pipeline Vector Processor 
has a peak performance of 10 Mflop/s tor the single 
operations (IEEE standard double precision) and 20 
Mtlop/s for the chained operations (e.g. vector dot 
product). The Vector Memory (VM) ensures a sustained 
performance close to the peak performance. The PVP 
performs also the common scalar floating-point arith- 
metical operations. The Scalar Arithmetic Coprocessor 
(SAC) (MC68882) provides additional floating-point 
functionality such as conversion, trigonometric and tran- 
cendental. The DMA/Address Generator (MAP) allows 
for a high-speed block transfer of data-structure objects 
(DSO): 


1. between Pipeline Vector Processor and Vector 
Memory. 


11. between Vector Cache and Node Memory. and 


ii. between Node Memories of different nodes 


Its microcoded address generators support all required 
access functions for the data-structure types “vector” 
and “matrix” [2.3]. The MAP functions are performed 
by an ASIC 


The Communication Coprocessor (CC) performs the 
functions of formatting. sending. and receiving of mes- 
sages by hardware in the microsecond range. It 1s real- 
ized by an ASIC. 


All four coprocessors utilize the same unique copro- 
cessor interface of the MC68020, whose functionality 
has been expanded by a special coprocessor interface 
ASIC. In addition to i. ie ASICS already mentioned, there 
are also ASICs for other functions such as memory error 
detection and correction (EDC), address decoding, data 
path multiplexing and bus protocol handling. All ASICs 
are realized by CMOS gate arrays from LSI Logic Inc 


The node memory utilizes DRAM SMDs with | Mbit 
capacity each. mounted on SIPs. The various static 
memories (Vector Memory, microprogram control 
stores) as well as the CMOS bus drivers are packaged as 
hybrid modules. 


The Suprenum Cluster 


A Suprenum cluster consists of 20 nodes accommodated 
inone 19 inch rack; 16 nodes are the “Processing Nodes” 
as described in the preceding section. The other 3 nodes 
are: the Cluster Disk Controller Node (DCN), the Inter 
Cluster Communication Node (CCN), and the Cluster 
Diagnosis Node (CDN) 


The nodes of a cluster communicate via the cluster bus 
system, which consists of 2 message switching parallel 
buses with 64 date lines cach. Doubling the cluster bus 


including its controller logic renders the cluster bus 








system tault tolerant and. by the same token. doubles the 
5 


a total ot 320 


interconnection bandwidth in the cluster t 
Mbytes per second (arbitrated). Several nodes can com- 
municate simultaneously via the cluster bus svsten, 


The remaining question how to interconnect the clusters 
iS discussed in the tollowing section, which presents the 
rationale tor the rather unique interconnection structure 
of Suprenum 


The Cluster Interconnection Structure 


The interconnection structure of a high-performance 
MIMD/SIMD system with a large number of nodes must 
be blocking-free in order to avoid the performance 
degradation. 


For many computer architects. the “classical” answer to 
the inter-node communication problem is the use of a 
multi-stage interconnection network (IN). realized in the 
form of either a circuit switching network. where a 
physical connection is provided directly from the source 
node to the destination node. or a packet switching 
network, where a logical connection 1s provided between 


source node and destination nod 


[There exist a large variety of IN structures. and many 
papers have been published dealing with the intercon- 
nection properties of the various network types. How- 
ever. a few papers address the issues of technical teasi- 
bility. interconnection bandwidth obtainable. packaging 
problems including the severe pin limitation problem 
one may be running into, driving power limitation, cost, 
and other mundane technical problems [4]. Here we 
briefly discuss the dichotomy between the solution that 
exhibits ideal interconnection properties at the cost of an 
unfavorable (N~)-complexity—the crossbar switch net 
work—and on the other hand the tavorable (VN /og 
V)-complex networks which have untavorable intercon 
nection properties 


The network type that provides total point-to-point 
connectivity without the danger of blockings ts the 
crossbar switch. However. the quadratic complexity of 
the crossbar solution limits its size for reasons of pin 
limitation and packaging complexity to 32 x 32 or 64 x 
64 at most. The network types that provide an optimal 
trade-off between interconnection properties and circuit 
complexity are the (N /og V)-complex networks 


There exists a way out of the dilemma of the technical 
non-feasibility of large crossbar switch networks and the 
unfavourable interconnection behavior of the INs with 
(N log N)-complexity. The 
approach in which either a crossbar network or an 
equal’y fast packet switching network 1s used to inter- 
connect clusters of nodes rather than the nodes them- 
selves. Consequently, instead of having to interconnect a 
large number of nodes only a much smaller number ot 
clusters needs to be interconnected 


solution iS a two-stage 
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Thus, the first stage of the two-stage interconnection 
network consists of the intra-cluster) interconnection 
siructures, while the second stage 1s formed by the 
inter-cluster interconnection structure. The advantage of 
this solution 1s that as long as the cluster size is kept 
sufficiently small there exists an extremely tast and 
economical solution for the intra-cluster interconnection 
structure, given in the form of the common parallel hus 

Parallel buses can be made rather wide. e.g. 64 data bits 
a measure that alone already guarantees a high intercon- 
necuuon bandwidth. In addition, as long as a parallel bus 
is kept short. it can be made quite fast. A parallel bus can 
be made fault tolerant, e.g. by adding a number of 
redundant bus lines. However. if one wants to combine 
fault tolerance with the highest possible interconnection 
bandwidth, the better approach 1s to simply double the 
parallel bus and have a bus arbiter that allocates to a 
requesting node cither one of the two buses. whichever 1s 
tree neat. In order to keep the length of the cluster buses 
sufficiently shori, we restricted the number of circutt 
boards to a maximum of 20. assuming 20 mm spacing 


In the Suprenum supercomputer, the clusters are inter- 
connected via the torus structure. The torus structure 1s 
formed by a matrix of bit-serial ring buses which 
transmit data ata rate of 2x 125 Mbits per second on the 
basis of the token-ring protocol. The net data rate, which 
the clusters of a torus must share. 1s about 20 Mbyte/s 
This is the reason why not more than 4 clusters are 
inserted in cach ring. so that there remains enough 
interconnection bandwidth per cluster. Doubling the 
torus structure by having row rings and column rings not 
only doubles the interconnection bandwidth but also 
renders the structure fault tolerant: should a ring tail 
there 1s Stull the possibility of reaching the clusters of the 
ring through alternative routing. Alternative routing 1s 
provided by the CCN » ach cluster (one of the special- 
ized nodes) 


The Node Operating System (PEACE) 


Suprenum has been designed as a message-based, loosely 
coupled system, the rationale for this design decision 
being twofold. Firstly, “hot spot contentions” that may 
easily arise in memory sharing systems are avoided 
Secondly, a high degree of fault tolerance had to be 
designed into the system, and thus. availability 
Designing Suprenum as a fault tolerant architecture 
implies all the characteristics of a distributed system 


Therefore. centralized resources—including a central- 
ized “global operation system” had to be avoided. Con- 
sequently, Suprenum has a local “node operating 


system in each node. while a global operating system 
exists only virtually. its functions being performed in 
reality by the collective of node operating systems 
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Mayor tasks of the node operating system are 

— local resources management. including access right 
control to memory 

— local process management 


— interprocess communication 


In performing its tasks. the node operating system may 
request services from other node operating systems, by 
the same token, 11 must be willing to provide services for 


other node operating systems 


There are strong reasons for designing the node oper- 
ating system as a multitasking Operating system. At the 
system level. a highly modularized design of the oper- 
ating system enhances its efficiency. security, and fast 
implementation. At the user level, multi-tasking 1s a 
prerequisite for constructing application software inde- 
pendent of the specific configuration of the machine. 1.e.. 
the number of the nodes: the application program 1s 
partitioned into a number of cooperating processes 


which are then distributed over the number of nodes of 


a particular configuration 


PEACE (Process Execution And Communication Envi- 
ronment) [5] 1s the Suprenum node operating system 
especially designed to meet the requirements outlined 
above. Specifically, PEACE supports the following fea- 
tures 

— remote access of resources (files. devices) in other 
nodes: 

— remote monitoring of system components in other 
nodes: 

— dynamic reconfigurability of the system after the 
detection of faults and dynamic reconfiguration for 
load balancing and service migration in user pro- 
grams. 


The architecture of PEACE ts based on the team concept, 
resulting in a highly modularized, hierarchically struc- 
tured system. Means of structuring in PEACE are: pro- 
cesses and teams. 


Processes are lightweight processes representing system 
components that render services to other such compo- 
nents: they are subject and object of access rights, and 
they allow readily the construction of dynamically recon- 
figurable systems. A team 1s a group of lightweight 
processes that share common access domains to intrinsic 
system objects such as files, memory segments. and 
processes 


A process requests a service from a remote server process 
by issuing a remote procedure call (RPC) message. Mes- 
Sage passing 1s based upon a synchronous communica- 
tion mechanism of maximal efficiency. 


PEACE 1s hierarchically structured, its core consists of 
— PEACE kernel 

— process server, 

— name server: 

— memory server: 

— team server 


Functions of the PEACE kernel are 

— interprocess Communication (supported by a specific 
communication Coprocessor } 

— process and address space switches 

— propagation of traps and interrupts as messages 

— message routing (send. reply) 


Functions of the name server are the issuing and mont- 
loring of name spaces and service access points (SAP) 
Functions of the process server are the issuing of unique 
process identifiers (PID) and the dynamic process 
administration. Functions of the memory server are the 
issuing of segment identifiers and the dynamic manage- 
ment of memory objects. The team server handles a 
variety of specialized teams that function as administra 
tors such as: name administrator, team administrator, 
memory administrator, panic administrator. signal 
administrator. clock administrator. device adminis- 
trator, and file administrator. the names of the adminis- 
trators indicating their role in the system. Each admin- 
istrator team usually encompasses several server 
processes (e.g. the memory administrator comprises the 
memory server and the MMU trap server. etc.) 


PEACE has been designed in MODULA-2 and was 
rewritten in C for performance reasons. PEACE has been 
optimized and tine tuned to render its basic function as 
tast as possible. At present. PEACE 1s believed to be the 
fastest message-passing operating system currently 
existing 
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{Article by Karl Heinz Werner. Ulrich Brass and Ernst 
Thomas: “The Suprenum User Interface] 


{Text} From the front-end the Suprenum multiprocessor 
along with all its resources and file system 1s accessed 











simply as a common U'nix-des ice to the user. Server like 
the job manager. the Suprenum kernel manager. and a 
tile server system can be invoked by Unix commands 
and take care for mapping and execution of jobs on a 
requested partition of the kernel, tor tile management 
and access rights. job security and I/O. The user interface 
is implemented on top of PEACE and Unix 


Very early in the conceptual phase of the Suprenum 
project it was clear that the global architecture of the 
system will be divided into the host system (one or more 
Unix machines) and the Suprenum kernel (sk), con- 
sisting of several independent nodes organized in clus- 
ters and pertorming high speed numerical programs 


An interesting question 1s how to use a high performance 
numerical computing device such as the Suprenum 
kernel within a Unix system from a user’s point of view 
A Suprenum system 1s expected to run in different 
environments: Computing centers, university institutes, 
industrial laboratories and others 


Impacts on the solution of this question come from the 
major usage of the system as a numerical supercomputer, 
specific decisions in the architecture of the system (clus- 
ter Structures, distributed disk system, etc.), properties of 
both operating systems (Unix on the host and PEACE [1] 
on the Suprenum kernel), the abstract programming 
model and the user expectations. 


The programming model ts based on independent tasks 
exchanging messages. There 1s always an initial task and 
a set of node tasks. Usually the node task 1s based on the 
same program, but with different data. The communica- 
tion model 1s asynchronous. This means for instance, a 
“send”’-operation performs without explicit blocking 
and without an explicit acknowledgement. 


The PEACE node operating system provides lightweight 


processes, organized into teams, a variable number of 


teams on a node, a rendezvous mechanism for interpro- 
cess Communication, hardware-supported high-volume 
data transfer and a remote procedure call mechanism 
based on interprocess communication as basic primi- 
tives. Remote procedure call 1s embedded in a distrib- 
uted name space concept. The various name spaces are 
connected to each other like directories in the file 
system. PEACE is optimized for fast process switches 
and fast network wide communication ([1,2]). 


A task from the programming model 1s mapped to a 
PEACE team consisting of the application process, a 
mailbox server and some other servers like a name server 
and signal server. During the design process of the user 
interface questions came up both in connection with the 
systems as well as from the users point of view. 


Typical questions were 

— In which way should jobs be executed in the sk, what 
is an acceptable granularity with respect to the yobs 
node ratio” 

— Host and node operating systems have to be con- 
nected. How should this be done? 
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— In which way are file-accesses handled in the distrib 
uted system” 

— How can user-identification and other security mech- 
anisms be extended to the Suprenum kerne! 

— What ts a good collection of tools tor a system 
administrator to manage the system with respect to 
disk usage. handling of user yobs and so on 


Since Suprenum is a distributed project. different project 
teams worked together in order to solve these and related 
questions. A common torum, the “Suprenum user inter 
face circle’, was established in the spring of 1988. In this 
paper we collect the design decisions as results trom the 
“Suprenum user interface circle” and experiences trom 
the implementation of prototype systems Currenth 
major design decisions are done and the implementation 
process 1s going on. Stull there is no experience with a 
complete system 


Basic Design Decisions 
The basic design decision is as tollows 


© The combined system ts represented as a homove 
Unix-system to the user, with the Suprenum Kernel a 
a specific (high-performance comput 


Modifications in the underlying Unix system produce 
costs by adopting new releases. Also. there exist different 
Unix systems (System V. Berkeley-Unix. etc.) For 
Suprenum, the target host system is a System V (V3) 
machine (the MPR 2300). but in the process of develop 
ment, other Unix machines were used 


¢ Search for portable sOLUTIONS WIA respect f TW ure 
running on the host 


As a practical consequence we try to minimize the 
number of software components that run on the host 
system. Most servers are prepared to run in th 
Suprenum kernel. There 1s a distributed disk system 
consisting of the host disks and the cluster disks. On each 
cluster disk there 1s a Unix file-system. Clearly the 
expectation 1s 


© There isa unique logical file-system 


The different cluster file-systems are mounted tn the host 
file system. A typical path could be 


sk/cluS/user/joe/pdesolver/euler/data44 


Access to files in the Suprenum kernel from the host 
system 18 permitted in a command set including th 
usual file handling commands ({3}). Th: 
system should run efficiently in different environment 

Since we were not able to predict what would be the best 
way of system management for cach possible 
ment, the decision was to 


Suprenum 


CNVITON 


anh Py MeuUre 


¢ Provide tools, so that the system 
SPC fic needs 


Last but not least the following principle was adopted 


e Increase overall system throughput 
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Single/Multi- lasking System 


The decision which type of single/multi-tasking system 
should be used 1s difficult. A typical Unix user would 
expect a tirmesharing system tor the Suprenum kernel 
Technically this 1s not impossible because the node 
Operating system supports multi-tasking. On the other 
hand, supercomputer users tend to expect a batch system 
for perilormance reasons 


In a parallel system, one can be middle of the road by 
partitioning the set of nodes for several users. What 1s 
the smallest unit available tor a job” 


In the Suprenum context, this can be a single computing 
node or a cluster. Clearly a single node allows a more 
flexible handling of user requests, but also the manage- 
ment overhead increases. Currently, the smallest unit 
available for a user job 1s a cluster 


If a partition 1s assigned to a job A, then no other job 1s 
allowed to use the computing nodes of this partition 
during execution time of job A. On the other hand, a user 
working on the host system 1s allowed to move data in 
and out of the clusters which are assigned to job A. Also 
all types of system processes may run in the partition 
assigned to user A 


This decision increases overall throughput by a (usually) 
minor decrease of system performance for a single job 


Job Management 


User jobs on the Suprenum kernel are started on the host 
by the skx command ([4]). By using options to this 
command, the user can request a certain number of 
clusters and other resources 


The skx part runs under Unix, it interprets the command 
line. initializes data structures for the job spooler. reads 
the (Unix-)environment, and controls the usage of files 
Then, the skx command sends the job request to the job 
spooler 


The job spooler controls a fifo-queue of sk requests 
There is a server called sk-manager ({5]). This compo- 
nent manages the Suprenum kernel in terms of actually 
available clusters, nodes and communication paths. By 
requesting information from the sk manager, the job 
spooler obtains information on which parts of the 
Suprenum kernel are available for job execution. If there 
is a job waiting and the resources are available the 
spooler forks itself and starts job-execution 


A job can be in one of three states: active, waiting o1 
frozen. Frozen means that a distributed job ts stopped 
and swapped out of the Suprenum kernel. There ts a 


package of tools allowing a single user to manage owned 
jobs. The system administrator has his own access party 
to the configuration of the job manager. In this way, it 1s 
possible to constrain resources of the Suprenum kernel 
for specific users 


Access Right: 


Phe usual identification and access mechanisms are 
extended to the Suprenum kerrel. Files on cluster disks 
are owned by a specific user. this yields read. write and 
execulion rights for the owner. the group and others. By 
Starting a job. the job-executer initializes the name space 
of the job. There 1s a security-server for this job. which is 
connected to the job-name space 
included in the name space. The tile server may issue a 
getu.d remote procedure call. which 1s replied by the 
security server with the user identification 


Also a file server 1s 


Environment 


Ihe Unix environment is read and sk-specific data are 
added by the skx-command. For each task this environ- 
ment is at hand. By routines like getenv. putenv the 
environment can be read into user programs, manipu- 


lated and put back 


Finding Files in the Suprenum hernel 


It may occur that initially a task with file 1/O will run on, 
say. cluster |. Thea tor the second time, the task will run 
on, say. cluster 2. Now the problem 1s how to find the old 
file(s) trom the first run. Our solution to this question 1s 
that logically a user is always at the same place in the file 
system during task execution, i.e. the current user direc- 
tory of the host 


All tiles in the Suprenum kernel. newly created or 
modified during task execution, are mapped by symbolic 
links in the current user directory by the full path names 
This 1s selt-documenting and enables the file server 
system to tind the files. By the environment mechanism 
a user can enter a /oca/ flag, which makes the local file 
server responsible tor moving remote data to the local 
disk ({3.6]) 


Connecting the Host and the Suprenum Kernel 


Physically. there are special VME-boards that fit into the 
host system. This system ts called CAC and tt consists of 
a CAC /processor board and a CAC/Suprenumbus board 
The latter connects the different clusters with the host 
system via the Suprenumbus,. a 125-Mbuit/s serial link 
On the CAC/processor a special version of the PEACE 
Operating system truns (with special device drivers). 
Logically, parts of the PEACE environment are emulated 
in the Unix environment. This allows the initial task to 
run on the host system. On the other hand, the initial 
task can run in the Suprenum kernel. Then. only a few 
servers run in the Unix environment providing for file 
access, graphics support and so on, while most of the 
servers run in the Suprenum kernel. This ts taster. but 
limitations like a fixed amount of money on a node in 
the Suprenum kernel become important 


Implementation of the User Interface 


A system of servers 18 used to realize the user intertace as 
described o>ove. At the outer level there are sk and job 
manager for execution of jobs. the file server system 
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mapping library (see [1]). NEWTASK returns a TASKID 
(see below) value or an array of those, respectively. 
Analogously to the EXTERNAL statement, the 
progname has to be specified in a TASK EXTERNAL 
statement if it 1s passed as argument, e.g. in NEWTASK. 


Process termination. A process is terminated if it termi- 
nates itself (STOP or END-statement), or if the initial 
process is stopped. Termination of the creating father 
process does not terminate the child process. 


TASKID data type. Each generated process is associated 
with a 32-bit identifier of data type TASKID. Variables 
of that data type are used as addresses in SEND/ 
RECEIVE-statements (see below) and they can only be 
defined by a NEWTASK function call. TASKID vari- 
ables can be passed to subroutines and may occur in 
COMMON .-blocks if they are not mixed with other data 
types. They must not occur in formatted I/O statements. 
The null value of this data type is the constant 
.NOTASKID. 


Message passing. The basic message passing primitives 
are SEND and RECEIVE. The syntax is 


SEND ({TASKID=] pid, [TAG=] tag) iolist 


The brackets [ ] indicate optional keywords. pid is the 
process identifier of the addressed process. Multicast 
and broadcast 1s possible by specifying a set of process 
identifiers or ALL, respectively. The tag is an additional 
information (integer number) which ts associated to the 
message. In addition to the address and the tag, the user 
can specify an error label and a status variable. 


The iolist is a list containing expressions, implied do- 
loops (similar to those in READ and WRITE state- 
ments), arrays, and array sections (see below). 


The syntax of the RECEIVE statement ts similar: 


RECEIVE ({TASKID=] pid, [TAG=] tag, 
SENDER= sender) 1olist 


The parameters are similar to those in the SEND state- 
ment. The TAG specification is obligatory. The optional 
“SENDER="” specification returns the process identifier 
of the sending process which might be useful to know if 
the “TASKID=” specifier has been omitted. Error labels 
and status variables can also be specified. 


The Suprenum message passing model 1s asynchronous: 

— The sending process continues immediately after 
execution of the SEND and can overwrite data in the 
iolist. The sending process does not wait for the 
execution of the corresponding RECEIVE by the 
receiving process. 

— Each process has a mailbox which contains all mes- 
Sages sent to this process and which have not been 
RECEIVEd yet. If a process wants to receive a 
message with a specified tag and (optionally) a spec- 
ified sender and no message with matching data ts in 
the mailbox, the process blocks until a matching 
message has arrived in the mailbox. 


— Messages do not preserve temporal order. Messages 
from the same sender can be distinguished ony by the 
lag. 


Selective receive. In order to avoid possible blocking of a 
receiving process, several expected process identifier and 
tag combinations can be specified in the WAIT state- 
ment 


WAIT ({[TAG=] tag, [TASKID=] pid, 
COND= condition, 

LABEL= label,....) 

CONTINUES label,, 


The tag/pid/cond/label combination may be repeated 
and several labels label,..... label,, may be specified. Ifa 
message is in the mailbox which matches to one of the 
tag/pid combinations and if the corresponding condition 
is true, the program branches to the associated label,. If 
none of the tag/pid/cond combinations 1s fulfilled the 
prog.am continues execution at label... 


Inquiry functions. The following inquiry functions are 

provided: 

— MYTASKID () gives the process identifier of the 
calling process; 

— MASTER () gives the process identifier of the initial 
process: 

— TESTTAG ({TAG=]tag) is a logical function which 1s 
TRUE. if a message with the specified tag is in the 
mailbox, otherwise it returns .FALSE.: 

— TESTMSG ({TAG=] tag, [TASKID=] pid) ts a logical 
function which is .TRUE. if a message with the 
specified tag and pid is in the mailbox, otherwise it 
returrs .FALSE. 


TESTTAG and TESTMSG can be used to avoid 
blocking processes. 


SI*1D Extensions 


SIMD vector processing within each process (and node) 
is supported by the array constructs as they are part of 
the new Fortran 8x standard [6], meanwhile also opti- 
mistically called Fortran 88 [7]. Although the standard 
has not been accepted finally by the responsible ANSI 
and ISO groups, some essential parts of the new Fortran 
including most of the array processing constructs can be 
regarded as stable. The formulation of an application 
program in terms of arrays and vectors instead in terms 
of (nested) loops is an important advance in program- 
ming vector computers. Using the new array notation, 
programming becomes more “object-oriented”, the 
codes look clearer, and automatic vectorizers are more 
or less superfluous since in the array notation vectors are 
expressed explicitly. 


In the following, only some important features of the 
array notation will be mentioned (see [8] for a detailed 
explanation). 


Array properties. An array is a named set of contiguously 
stored data entities. All elements of an array must be of 








the same data type. Subsets of arrays are called array 
sections. Each array has a data type, a rank (the number 
of dimensions, less than or equal to 7), a size (total 
number of elements), and a shape (defined by the rank 
and number of elements in each dimension). Depending 
on when they are defined, arrays can be declared as 
explicit-shape, assumed-shape, assumed-size, or 
deferred-shape arrays. 


A 


Array subscripts. An array or an array section 1s refer- 
enced with one or more subscripts in a subscript list. The 
subscripts may be triplets (lower bound : upper bound : 
increment) or vector subscripts. Example: 


REAL, ARRAY (20) :: A. B. ¢ 
INTEGER IND(10) 
IND = [4,3.2.1.9,10.8.7.6,5] ' array constructor 
A(2:20:2)=BUND)+C(10:1:-1) | triplets and vector subscripts 
' A(2)=B(4)+C(10), 
' A(4)=B(3)+C(9).. 


' new declaration statements 


Dynamical allocation. Arrays can be declared without 
specified bounds. Execution of an ALLOCATE state- 
ment specifies the bounds and makes the array definable. 
The dynamic allocation can be used e.g. in parallel 
matrix or grid applications when the size of the subma- 
trices or subgrids depends on the number of processes. 
This number 1s usually an input parameter and not 
known at compile time. Example: 


REAL, ARRAY, ALLOCAT- —! deferred shape 
ABLE (:.:) :: GRID 
' declaration 


RECEIVE (...) NPROCX, ' receive process 


NPROCY 
' configuration 
ALLOCATE GRID (O:NX ' dynamical allocation 


NPROCX, O:NY/NPROCY) 


DEALLOCATE (GRID) ' deallocation 


Assignments. Before an assignment variable=expression 
is made, the expression is evaluated completely. 
Example: 


A(1:10) = A(10:1:-1) ' reverses elements of array A 


Conditional operations. Array assignments can be 
masked using the WHERE-statement. Example: 


REAL A(10), BOLO) 


WHERE (A > 0.0) B ' the assignment 1s evaluated 
LOGI(A) only 
' where the elements of A are 
a) 


WHERE can be regarded as “vectorized” IF and can 
similarly be used as block statement together with ELSE- 
WHERE and ENDWHERE. 
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An array assignment can be specified in terms of array 
elements using a FORALL statement. Example: 


REAL GRID (0:M. O:N) 

GRID = 0 ' set all grid points 
"10 O 

FORALL (l=1:M-1, Jet:N-1)  ' set interior grid 

GRID(U.J) = | 


' points to | 


Intrinsic functions. Fortran 8x provides new intrinsic 
functions related to array processing. Here are some 
examples: 


DOTPRODUCT (VA.VB) ' dotproduct of two ID- 


vectors 

MATMUL (MA.MB) ' matrix product of two 
matrices 

MAXVAL (A) ' maximum value of all ele- 


ments of A 


MAXLOC (A) ' location of maximum ele- 


ment 
SUM (A) 'sum of all elements of A 
ALL (MASK) ' determines whether all ele- 
ments 


'in MASK are true 


Some functions can be called with optional DIM and 
MASK arguments. The use of the new intrinsic functions 
is recommended since they are implemented very effi- 
ciently on the Suprenum node either as assembler pro- 
grams or in microcode. 


In order to exploit the vector floating point unit of the 
Suprenum node efficiently. special vector instructions 
have to be generated. The compiler can generate these 
vector instructions only, if 


1. the vectors are already formulated in the Fortran 
source using array notation as described above or 


2. if a loop-based code is transformed by a vectorizer. 
Since not only new Fortran 8x programs but also 
“old” Fortran 77 programs should run on the 
Suprenum node, the Suprenum-Fortran vectorizer 
has been developed which works either as an inte- 
grated vectorizer in combination with the Suprenum- 
Fortran compiler or as a source-to-source transformer 
which generates readable Fortran 8x code. 


Miscellaneous Extensions 


Suprenum-Fortran covers most of the Vax and IBM 
extensions as DOUBLE COMPLEX, INTEGER*?2 and 
BYTE data type and additional numerical intrinsic func- 
tions as COTAN, GAMMA, ERF. As soon as the BIT 
data type is finally defined in the Fortan 8x standard, it 
will be supported by Suprenum-Fortran. 
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Parallel Programming 
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[Article by Bernhard Thomas and Klaus Peinze: “Supre- 
num Comfort of Parallel Programming’’] 


[Text] Program development for the Suprenum multi- 
processor is based on an abstract machine concept. 
Languages such as Suprenum-Fortran and a rich choice 
of tools support, and considerably facilitate, the writing, 
testing, and analysing of parallel applications in the 
frame of this programming model. 


System architecture and software components of 


Suprenum have been described in detail in preceding 
contributions [1-3]. But in fact the programmer of a 
Suprenum system does not have to be aware of all these 
details when developing and running his/her software. 
This 1s one important issue of the Suprenum software 
concept, which ts depicted in Figure 1. 


Programming for the Abstract Machine 


Application software development for Suprenum can be 
based on a very general programming model, the 
Abstract Suprenum Machine. The model allows program 
design in terms of concurrently executing, communi- 
cating processes and can be mapped in principle, to a 
wide range of MIMD-parallel, distributed memory sys- 
tems. Suprenum, in particular supports this view on both 
hardware and system software level as well as by dedi- 
cated language extensions. 
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The Abstract Supreneum Machine comprises the fol- 

lowing concepts: 

— an application consists of a dynamical system of 
processes that are generated from independent pro- 
gram units: 

— there is one initial process that initiates the distrib- 
uted application: 

— each process can create other processes at any time, 
termination of a process 1s an internal event, e.g. by 
executing a STOP: 

— termination of the initial process, however. will end 
the distributed application, 

— processes have access to private data space only: 

— inter-process data requests are handled strictly by 
message passing: 

— the process system can be of arbitrary structure (with 
respect to creation and interprocess-communica- 
tion); 

— vector and array processing can be programmed for 
within processes. 
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Figure 1. The Suprenum software concept, a layered 
Structure providing system transparency on various 
levels. 


Obviously, configuration details such as the cluster 
Structure or the interconnection system do not explicitly 
occur in the abstract machine context. There 1s, of 
course, a natural understanding of how these concept 
items might go together with a given hardware: processes 
will usually be thought of as executing on individual 
processors (or nodes), with the initial process being 
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located on a tront-end processor. Also, vectorized oper- 
ations will be assigned to the vector floating-point unit in 
a node. 


Yet there 1s no need to be concerned about these aspects 
in the course of program development. For example, an 
application designed to comprise a certain number ot 
processes may actually run on a smaller number of 
processors (multi-processing on nodes), with lower per- 
formance, of course, but without any implications for the 
program development. 


In a sense, the granularity of the individual processes 
should also be reflected, which indeed introduces a 
hardware aspect into the programming model. But again. 
this can be kept quite general, and may function merely 
as a guide to the complexity of the tasks to be performed 
in a process. 


The layout of a parallel application in terms of the 
Abstract Machine concept 1s quite easy. The initial 
process 1S written as a single main program that will 
usually take care of creation and initialization of the 
process system (e.g. provide parameters and initial data) 
as well as for general I/O. All other processes are execut- 
able copies of one or several task programs, which are 
written according to the chosen parallelization strategy. 
In grid based applications, for example, the same task 
program usually codes for all necessary operations to be 
performed on a part of the grid (see the grid partitioning 
parallelization paradigm in [4]). In other applications 
there might be independent algorithmic components 
that can be programmed for concurrent execution. Cre- 
ation of new processes and message based data exchange 
are programmed for whenever needed or wherever sult- 
able in task programs of the initial program (dynamical 
process creation). 


Within task programs attention should be paid to any 
portion of code that suits vector processing mode. Here 
the programmer may choose to utilize SIMD-parallelism 
within MIMD-parallelism by writing vector instructions 
explicitly. 


The Abstract Suprenum Machine 1s a useful concept for 
mapping high level parallelism identified in an intended 
application onto an efficient process structure, before 
taking pains to code it in a particular programming 
language for a particular system. Neveri. eless, a com- 
fortable programming environment and suitable lan- 
guage primitives are needed to support this model. 
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Using Suprenum-Fortran 

Suprenum provides two programming languages that 
directly provide this support by means of language 
extensions: Suprenum-Fortran and Concurrent Modula- 
2. Focusing on numerical applications, we will mainly 
consider Suprenum-Fortran throughout this contribu- 
tion. 


There are many reasons for choosing Fortran as a 
primary language in scientific computing, and these have 
been given in [3] along with a detailed description of the 
main extensions. Here we concentrate on the MIMD- 
oriented features. 


The example below gives code fragments in Suprenum- 
Fortran that realize the following abstract model of a 
relaxation algorithm as it might be used in solving 
boundary value problems for elliptical partial differen- 
tial equations: 


Initial process 

— get grid dimensions and problem parameters interac- 
tively: 

— create 2-D array of processes executing the relaxation 
program (see node process) and send parameters of 
the application (e.g. identifications of neighboring 
processes, grid, subgrid and process array extensions) 
to either process as well as initial data on the corre- 
sponding subgrid: 

— receive solution data from the processes and estab- 
lish global results: 

— slop. 

Vode processes 

— receive parameters and initial data for local subgrid: 

— perform computations. essentially the conventional 
relaxation routine; 

— after each computational step, update values in 
points near inner boundaries by mutual exchange 
with neighbor processes: 

— retrieve and send out global results (residual norms). 
preferably along a tree-like structure; 

— send results to host. 


The initial process is written as an initial task program 
whereas the node processes can be generated from one 
task program (called NODEPRG in this example). Note 
the ease of using tags (see [3]) to make asynchronous user 
data exchange between processes logically safe. and to 
ensure that computational parts are written in common 
Fortran 77 throughout and kept distinct from commu- 
nication parts. A typical consequence of this strategy 1s 
that large parts of existing Fortran codes, e.g. written as 
subroutines, can be reused unaffected by parallelization 
requirements. 
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c 


Host Program 


Geclarations of arrays etc. 


declaration of TASKID variables and arrays 
TASKID PID(..,..) 
TASKID SOUTH, NORTH, WEST, EAST 


n 


initialize tegs used in SEND's and RECEIVE’s 
INTEGER TIN, TST, TSO, TRE 
DATA TIN/10/, TST/11/, TSO/12/, TRE/13/ 


a) 


user input: 


oOo 


initial solution U0, right hand side F 
READ(...) NPX, NPY, NX, NY, U, F 
NPeNPX * NPY 

C compute size of subdomains 

IPSX=NX/NPX 

IPSY@NY/NPY 


ia) 


ie] 


DO 10 IPX = 1, NPX 
DO 10 IPY = 1, NPY 
NEWTASK('NODEPRG’, PID(IPX,IPY)) 
C compute index boundaries of subgrids 
and store to index arrays IX, IY 


n 


10 CONTINUE 
C distribution of parameters of the application 
DO 20 IPX = 1, NPX 
DO 20 IPY = 1, NPY 
C identify neighbors of process in IPX, IPY 
SOUTH = .NOTASKID. 
IF(IPY.NE.1) SOUTH=PID(IPX, IPY-1) 
NORTH = .NOTASKID. 
IF(IPY.NE.NPY) NORTH@PID(IPX, IPY+1) 
WEST © .NOTASKID. 
IF(IPX.NE.1) WEST=PID(IPX-1, IPY) 
EAST = .NOTASKID. 
IF (IPY.NE.NPX) EAST@PID(IPX+¢1,IPY) 


C send parameters and process specific information 


C to process in IPX, IPY 
SEND (TASKID=PID(IPX,IPY), TAG=TIN) 
6 IPX, IPY, IX(IPX, IPY, 1:2), IY 
4 SOUTH, NORTH, WEST, EAST 
20 CONT INUE 
C send initia] data right hand side 
DO 30 IPX = 1, WPX 
DO 30 IPY = 1, NPY 
SEND (TASKID=PID(IPX,IPY), TAG=TST) 
‘ O(IX(IPX, IPY,1) : IX(IPX,IPY, 2), 
. TY(IPX, IPY,1) :IY(IPX, IPY,2)), 
- F(IX(IPX, IPY,1) : IX(IPX,IPY, 2), 
* TY (IPX, IPY,1) :1Y(IPX, IPY, 2)) 
30 CONTINUE 
C receive solution in arbitrary order 
IP-0 
IF(IP.LT.NP) THEN 
RECEIVE (TAG=TSO) IPX. IPY, 
. U(IX(IPX, IPY.1):IX(IPX, IPY,2), 
. TY (IPX, IPY,1) :1Y(IPX, IPY, 2)) 
IP = IP+] 
ENDIF 
C receive residual norms in arbitrary order 
IP=0 
RES = 0.D0 
IF(IP.LT.NP) THEN 
RECEIVE (TAG=TRE) RESLOC 
RES = MAX (RES, RESLOC) 
IP © IP+} 
ENDIF 
C postprocessing 


C end of host program 
STOP 
END 


process configuration NPX x NPY, grid size NX x NY 


create 2-D array of node processes from task program NODEPRG 


Node Program: 


C declarations of arrays, TASKID variables etc. 


TASKID SOUTH, NORTH, WEST, EAST 


oa 


initialize tags 
INTEGER TIN, TST, TSO, TRE 
DATA TIN/10/, TST/11/, TSO/12/, TRE/13/ 
receive parameters and initial information 
RECEIVE(TAG=TIN) IPX, IPY, IX, IY, 
. SOUTH, NORTH, WEST, EAST 


io) 


receive initial data and right hand side 


a 


RECEIVE (TAG=TST) U(IX(1):1K(2), IT¥(1):1¥(2)), 


F(IX(1):1K(2), TY(1):1¥(2)) 


Qo 


DO 10 IT = 1, MAXIT 


o 


CALL RELAX(U, F, ...-) 


C data exchange across inner boundaries by message passing 


TEX = 100 + IT 


SEND (TASKID@WEST, TAG@TEX) U(IX(1),1Y¥(1):1Y(2)) 
SEND (TASKID#EAST, TAG=TEX) U(IX(2),1Y(1):1Y¥(2)) 
SEND (TASKID=SOUTH, TAG*TEX) U(IX(1):1X(2),1Y(1)) 
SEND (TASKID=NORTH, TAG=TEX) U(IX(1):1X(2),1Y(2)) 


iterative loop (pre-assigned number of passes: MAXIT) 


subroutine RELAX contains the usual sequential program 





RECEIVE (TASKID@WEST, TAG@-TEX) U(IX(1)-1,1¥(1):1¥(2)) 
RECEIVE (TASKID*EAST, TAG=TEX) U(IX(2)41,1¥ (1) :1¥ (2)) 
RECEIVE (TASKID@SOUTH, TAG=TEX) U(IX(1):1X(2),1¥(1)-1) 
RECEIVE (TASKID=NORTH, TAG=TEX) U(IX(1):1X(2),1¥(2)+1) 


C end of iterative loop 
10 CONTINUE 
C send local solution to host 
SEND (TASKID=MASTER(), TAG=TSO) IPX, IPY, 


‘ U(IX(1) 21K (2), 1Y 1): TY (2)) 
C send local residual norms to host 
CALL RESID(U, F, RES, ...) 


SEND (TASKID=MASTER(), TAG*TRE) RES 
C end of node program 

STOP 

END 


Clearly, within the computational parts, SIMD- 
parallelism can be exploited by using appropriate vector 
notations as provided by Suprenum-Fortran. Besides. 
the Suprenum-Fortran compiler includes an autovector- 


izer (see [3.5]). 


Employing Libraries 


As it was noted in the previous section, collecting and 
sending out global data from a grid of processes could be 
done most efficiently in a treewise fashion. This would 
imply treating the collection of processes as a tree- 
Structure rather than a grid. On Suprenum, multi- 
structured process systems are supported, being an issue 


of the Abstract Machine model view. 


For example the grid structure laid out ina 2-D TASKID 
array in the example program ‘.agment above can easily 
be supplied with an additional tree structure by intro- 
ducing a father/left-son/right-son scheme of TASKIDs in 


each process. 


Structuring process systems and having process struc- 

tures distributed across available processors is particu- 

larly facilitated by the Mapping Library. The library can 
be involved to either establish: 

— for a particular process set, one of the elementary 
topologies (trees, rings, cubes, etc.) or a general graph 
topology specified by a (weighted) adjacency matrix; 

— automatic process placement, where a new process 1s 
loaded onto a new processor taking current workload 
into account; 

— semi-automatic process placement, where the pro- 
grammer may specify a new process to be placed onto 
the same processor or same cluster or elsewhere. 


Whereas the first two items are automatic mappings, the 
third one along with explicit process placement enables 
the user to directly control where a process has to go 
({6]). It is noted that programming can use various levels 
of transparency with respect to the underlying hardware 
according to the special requirements of the pro- 
grammer. 


Thus, if the programmer does not want to take care of 


process structuring and placement, he might resort to 
mapping library routine calls. If he prefers not to worry 
about process creation and communication at all, he 
might even write programs completely in Fortran 77 
(probably including Fortran 8x notations provided by 
Suprenum-Fortran) by relying on Communication 
Library calls. 


For the current version of the communication library, 
this is only meant to work for problems where grid 
partitioning 1s the parallelization paradigm. However, 
they are numerous and occur in various fields of appli- 
cations (see [4]). The table below lists example routines 
that would correspond to whole parts of the relaxation 
program above. 


Initial Task Task Comment 
creation of 2-D 
grid of processes 
generated from 
specified task pro- 
gram, also gener- 
ates logical tree 
structure: transfer 
of initial informa- 
tion 


crgr2d grid2d 


send/receive values 
in boundary area 
of specified width 


supdt2 


to specified neigh- 
bors (2-D prob- 
lems) 


rupdt2 
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compute global 
values from local 
ones according to 
specified 
arithmetic opera- 
tion: distribute and 
receive results (ini- 
tial task: receive 
only) 


gloph glops 


agglm2 agglomeration for 
2-D process grids: 
maps logical pro- 
cess grid onto 
another one of dif- 
ferent size 


Essentially, 2-D and 3-D process grid and tree structures 
are created simply by subroutine calls. It 1s worth men- 
tioning that all the necessary process identification man- 
agement (by data of type TASKID) is completely hidden 
in the subprograms of the communication library. Com- 
munication across inner boundaries is done by calling 
boundary data exchange routines, where the depth of the 
boundary layer as well as an ordering of points can be 
respected. Several other services are accessible according 
to needs, including collection and sending out global 
values tree-wise, and agglomeration, a strategy applied in 
multi-processor implementations of multigrid methods. 


Besides the aspect of merely programming in plain 

Fortran, there are several general advantages of 

employing the communication library in grid-oriented 

numerical software: 

— programming is safer using well tested and optimized 
routines for data exchange across process grids; 

— redundant coding is considerably reduced; 

— software becomes portable for a wide class of multi- 
processor systems, since only the library has to be 
adapted, which can be done safely, leaving the user 
program essentially unaltered, on date implementa- 
tions have been done on 1PSC, Ncube. and even to a 
shared memory system ([7]). 


Writing Programs 


Program editing can be done on the front-end system or. 
equally well, on remote machines like Unix worksta- 
tions. In any case, the programming environment 1s 
Unix-based [2] and thus familiar to most numerical 
programmers. 


Besides, there 1s a comfortable, language-dependent pro- 
gramming system which allows maintaining sources. 
keeping version control. facilitating code writing and 
doing syntax checks even on program fragment level. 
The programming system is operated through a window- 
based look-and-feel type surface and makes program- 
ming much easier and safer. Details are given in [8]. 


With a set of files constituting the parallel program (e.g. 
an initial task program relaxh.f and associated task 
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programs, e.g. relaxn.f), ways may now temporarily 
diverge: the programmer might either hope for a fault- 
free, well-performing program and push it forward onto 
the Suprenum multiprocessor (see below). Or, he might 
take a more careful step in using the Suprenum simulator 
instead, which runs on the front-end or a Unix worksta- 
tion. 


Simulations With SUSI 


SUSI, the Suprenum simulating system, 1s an implemen- 
tation of the Abstract Suprenum Machine, and can be 
used to simulate a parallel application on a conventional 
architecture. SUSI consists of language dependent pre- 
processors, a hardware configurator, a user interface, 
runtime systems for both languages (Suprenum-Fortran 
and Concurrent Modula-2), and the actual scheduling 
and execution kernel. 


The preprocessor is invoked for the Suprenum-Fortran 
example by: 


mt77 [mf77-options] [f77-options] [Ild-options] 
relaxh.f relaxn.f 


where the list of program files might be preceded by the 
common f77 and linker options as well as by special 
preprocessor options which include a reference to the 
mapping library. mf77 resolves MIMD-constructs in 
Suprenum-Fortran into subroutine calls, compiles them, 
and links them with SUSI’s runtime system. The simu- 
lator kernel takes care of scheduling processes and object 
dispatching based on coroutine concepts from Modula- 
2. Another version of the simulator currently under 
development is based on an emulation of the basic 
distributed operating system PEACE [1]. 


The SUSI user interface provides a flexible and compre- 
hensive control of the simulation run, and is able to 
extract all sorts of statistics and trace data on activities of 
such objects like user processes, CPUs, busses, mail- 
boxes, etc. To begin with, the hardware configurator 
expects input of a hardware configuration, to which the 
application will be mapped. This is usually contained in 
a hardware description (HADES) file, which might be 
specified as in the following example, or be generated by 
a graphical tool (HADESGEN). 

Systems liks SYSI and SYS2 are “plugged together” 
from clusters, intercluster busses, CPUs and cluster 
busses to taste. 


After starting the simulation run, SUSI's interface 
prompt will show up. The simulation may then be run 
for a specified period of simulation time by issuing 
commands like: 


start simtime=0.2 


It may be interrupted by CTL-C or after simtime elapsed 
and continued, eventually with new options set, or 
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TYPE 
Cpul = CPU MIPS = 2.0; TSLICE = 0.05 END (* cpu *); 
Busl = BUS MBITS = 30.0; END (* bus *); 
Bus2 = BUS MBITS = 300.0;END (* bus *); 
Icbl] = ICB MBITS = 24.0; END (* inter-cluster-bus *); 
Icb2 = ICB MBITS = 512.0;END (* inter-cluster-bus *); 
Clul © CLUSTER Cpuj (8) : Cpul; 
Bus} : Busl;END (* cluster *); 
Clu2 = CLUSTER Cpuk [8] : Cpul; 
Busk : Bus2;END (* cluster *); 
Hl = HOST Cpul {1} : Cpul; 
Busl : Busl;END (* host *); 
H2 = HOST Cpum [1} : Cpul; 


Busm : Bus2;END (* host *); 
SYS1 = LATTICE (* system configuration *) 
a # Icbl Icbi ; 
a # #1 #  ; 
Icb1 # cClul Clu) ; 
Icb1 # Clul Clul ; 
END (* lattice *); 
SYS2 = LATTICE (* another system configuration *) 
7 # Icb2 Icb2 ; 
“ # H2 - ; 
Icb2 # Clu2 Clu2 ; 
Icb2 # Clu2 Clu2 ; 
END (* lattice *); 


END. 


stopped to terminate the simulation. Whenever appro- 
priate, a variety of trace information on all kinds of 
events can be switched on or off. As an example, by 


giving 
trace on msg all (up=system) 


listings like the one below will be output from SUSI 


0.003202 UPOOICPUO0OO: multicast No= 9 sent: PAT/TAG=1 

0.003230 CPU001: multicast No= 9 deliv. to UPOO2CPU001 
OrigNo= 9 PAT/TAG=1 

0.003230 CPU002: multicast Noe 13 deliv. to UPOO3ICPU002 
OrigNo= 9 PAT/TAG=1 

0.003230 CPU003: multicast Noe 14 deliv. to UPOO04CPU003 
OrigNo= 9 PAT/TAG=1 

0.003230 CPU004: multicast No= 15 deliv. to UPOOSCPU004 
OrigNo= 9 PAT/TAG=1 

0.003340 UPOO2CPU001: multicast No= 9 recv from UP0O0ICPU000 


OrigNo= 9 PAT/TAG=1 

.003340 UPOO3CPU002: multicast No= 13 recv from UPO0O1CPU000 
OrigNo= 9 PAT/TAG=] 

.003340 UPOO4CPU003: multicast Now 14 recv from UP001CPU000 
OrigNo= 9 PAT/TAG=1 

multicast No» 15 recv from UPOO1CPU000 
OrigNo= 9 PAT/TAG=1 


o 


o 


o 


003340 UPOOSCPUOD4: 


0.003522 UPOOICPUO00: userdata No= 4] sent to UPOO2CPU001 
PAT/TAG=2 

0.003609 CPU001: userdata Woe 41 deliv. to UPOO2CPU001 
PAT/TAG=2 

0.003719 UPOO2CPU001: userdata No= 4] recv from UPOOICPU000 
PAT/TAG#=2 

0.003837 UPOOICPUO00: userdata No= 42 sent to UPOO3CPU002 
PAT/TAG=2 

0.003924 CPU002: userdata No= 42 deliv. to UPOO3CPU002 
PAT/TAG=2 

0.004034 UPOO3CPU002: userdata No= 42 recv from UPOO1CPUOCO 
PAT/TAG#=2 


The complete range of commands and options exceeds 
the scope of this contribution; for further reference see 
[9,10]. Graphic tools (as described below) can be used to 
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process trace data and facilitate the understanding of 


what the parallel application actually does as a process 
system. 


SUSI is employed to do logical testing of the complete 
parallel application, to get a feeling for loads on CPUs, 
busses, etc., for the communication paths, inappropriate 
message scheduling, and deadlocks. It provides some 
rough estimates on resource utilization and efficiency. 


If it is felt that the program is perfect, with respect to 
SUSI, one wants to see real performance by running it on 
the actual hardware. In most applications that promise 
good parallel efficiencies on theoretical grounds, the 
problem will be scalable. That is, to do real life calcula- 
tions the user will only have to change some (interac- 
tively set) parameters that describe the size of the appli- 
cation, e.g. the total grid size and the size of the subgrids 
within one process. There will be no rewriting of any part 
of the code to switch from SUSI to Suprenum hardware 
in such cases. 


Running Programs on a Real Configuration 


The Suprenum user interface (see [2]) provides the 
necessary commands and services to let a user run his 
application on real hardware without having to bother 
about multi-processor specific details such as resource 
allocation, job control, downloading and the like in 
depth. Since the Suprenum system 1s perceived by the 
user as a homogeneous Unix-system with the Suprenum 
kernel hardware appearing as a specific device, conven- 
tions for e.g. file access, job executing, and querying the 
status of the system are quite familiar. In particular, 
details about the kernel operating system (PEACE, see 
[11]) are not visible on the user interface level. 


To run a program that has been compiled and linked by 
the Suprenum-Fortran compiler (see [3]), the job man- 
ager can be invoked by the Suprenum kernel execute 
command skx as simply as e.g. 


skx -n 4 -t 20:00 relax 


Here, relax is assumed to be the executable file generated 
from relaxh.f and relaxn.f in the above example. Option 
-n claims 4 clusters to be needed for the run, and -t gives 
the time limit. A detailed discussion 1s given in [2]. 


Keeping Informed 


Besides displaying the actual user program output which 
may make use of GKS and X/Windows based tools on 
the front-end there are several facilities that provide 
information about what ts happening on the allocated 
part of the machine. Among these are: 

a parallel debugger; 

performance analysis and profiling: 

state reporting, 


which can be invoked together with the execution com- 
mand. 
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The state reporting facility, in particular, will extract 
data on the process and message passing activities of a 
distributed application running on Suprenum hardware, 
similar to the trace data generated by SUSI's trace 
Oplions. 


A set of visualization tools operate on such data streams 
to produce a comprehensive graphical representation of 
the overall behaviour of the application: the dynamic 
map, the time map, and the statistic map. 


All three of them take a stream of standardized event 
descriptions as input. A fi/ter program extracts relevant 
information either from SUSI traces or from the state 
reporting facility and condenses it into this standard 
event format. What information is considered to be 
relevant as well as hints where to find it in a line of 
information can be specified by the user in a filter 
description file that acts as an interface between the 
various sources of process data and the filter program. 


The dynamic map produces an animated view of the 
activities of processes. processors, mailboxes, and of the 
interconnection system along with information on sent- 
Out or incoming messages. The animation can be slowed 
down to follow through the actions of the system, e.g. 
right before a deadlock situation. 


The time map produces a comprehensive temporal over- 
view of the behavior of the complete application. It 
displays a record of states of processes and associated 
mailboxes as well as of process creation and intercom- 
munication over the full time-interval. The graphics 
provide a “first glance” of the communication schedule 
and active-to-inactive time ratio of processes. Analyzing 
the display in more depth can provide useful hints to 
inefficient organization of the communication and lead 
to improved tuning of the application. 


The statistic map generates a final evaluation of the 
reported data and displays some statistics on the perfor- 
mance of a distributed application. It computes figures, 
and displays them graphically, for e.g. the load of pro- 
cessors, number of interprocess communication events, 
occupation of mailboxes, as well as time period spent on 
message passing activities in much details. 


An overview of the complete visualization system is 
given in [12]. The visualization can be run concurrently 
(in the sense of Unix pipelined processes) with the 
application by establishing a suitable pipeline, including 
the filter. However, 1t usually does not make much sense 
to monitor behavior of an application with these tools 
on-line, unless it is run as a simulation. Instead, reported 
data will be written on file, either on cluster disks and/or 
on external mass memory, for further postprocessing by 
the visualization tools. Hard copies of the visualization 
can be made most easily with a suitable color graphics 
printer. 











JPRS-EST-89-026 
7 September 1989 


References 


1. Giloi, W.. 
this issue. 


Suprenum—the system, Supercomputer, 


2. Werner, K.. Zhe Suprenum user interface, Supercom- 
puter, this issue. 


3. Solchenbach, K.,. Suprenum-Fortran—an MIMD/ 
SIMD language, Supercomputer, this issue. 


4. Solchenbach, K., Application software for Suprenum, 
Supercomputer, this issue. 


5. Trottenberg, U., Suprenum—the concept, Supercom- 
puter, this issue. 


6. Kramer, O., Suprenum mapping librarv—user 
manual, GMD, St. Augustin, 1987. 


7. Hempel, R., Zhe Suprenum communications subrou- 
tine library for grid-oriented problems, Mathematics and 
Computer Science Division, Argonne National Labora- 
tory, Argonne, Il, 1987. 


8 Thies. Ch... Anleitune zum Benutzen von PSG- 
Programmuierumgebungen, Report PI-R9/87, TH Darm- 
stadt. FB 20. 1987 


9. Limburger, F., Ch. Scheidler, Ch. Tietz and A. Wes- 
sels, Benutzeranleitung des Suprenum-Simulations- 
systems SUSI, GMD, St. Augustin, 1986. 


10. Tietz, Ch., Das Benutzer-Interface des Suprenum- 
Simulationssvstems, GMD, St. Augustin, 1987 


11. Bast, H.-J... M. Gerndt and C.-A. Thole, SUPREB— 
The Suprenum Parallelizer, Supercomputer, this issue 


12. Thomas, B., E. Thomas and E. Truchet, Suprenum 
visualization tools for distributed applicatitons—user’s 
guide, Suprenum Report |2, Bonn, 1988. 


Application Software 
36980245 Amsterdam SUPERCOMPUTER tn English 
Vilar SY pp 44-50) 


[Article by Karl Solchenbach: “Application Software for 
Suprenum”’] 


[Text] About one third of the SUPRENUM develop- 


ment resources has been spent for implementation of 


application software packages. mainly from scientific 
computing as CFD and statistical physics. Some of the 
codes are based on sequential versions which have been 
“parallelized’’, others have been written completely new. 
The paper gives an overview over the SUPRENUM 
application codes and sketches briefly the underlying 
parallelization techniques. 


The availability of practical relevant application soft- 
ware 1s decisive for the scientific and commercial success 
of a new computer architecture like Suprenum. This 
software must make use of the specific advantages of the 
architecture and translate these advantages into gains of 


speed. It 1s, however, not desirable to use old numerical 
methods on advanced computer architectures. Only etti- 
cient numerical algorithms in combination with the 
parallel Suprenum hardware can provide the computing 
performance which 1s required for large scale scientific 
and technical simulations. Consequently, roughly one 
third of the Suprenum development resources have been 
spent for application software development which 
covers the implementation of new algorithms as well as 
the parallelization of existing codes. 


The application software tor Suprenum 1s based on the 
Abstract Suprenum Architecture (see [1]). 1e.. the par- 
allel programs are formulated in terms of parallel pro 
cesses. The number of processes and their topology 
(defined by the message passing Communication) are 
primariy prescribed by the numerical problem and they 
are independent of the actual hardware configuration 


ihe programming language tor nearly all of the applica- 
tion packages 1s Suprenum-Fortran (see [2]). Before the 
hardware was available, the application software devel- 
opment was done on the Suprenum simulator 


Application Software Packages 
Linear Algebra Package 


Suprenum provides parallel algorithms tor linear algebra 

computations including: 

— vector and matrix operations: 

— elimination methods for linear systems with dense 
matrices (Gaub, Cholesky): 

— elimination methods tor linear systems with banded 
matrices (reduction methods): 

— solution of eigenvalue problems: 

— iterative solvers for sparse systems (conjugate gradi- 
ents, incomplete decompositions, ADI, block relax- 
ation). 


The interface to the dense matrices solvers 1s an exten- 
sion of the LIN-PACK interface. 


Multigrid Software 


A library of multigrid solvers for elliptic boundary value 
problems 1s available on Suprenum. The partial differ- 
ential equations are of the type 


V -(DVu)+cu=f 


Different classes of coefficients and boundary conditions 
can be selected. The domain is a 2-D or 3-D cube. The 
solvers are based on hishly efficient parallel multigrid 
algorithms. Due to their high degree of parallelism they 
can be parallelized (acro.s the Suprenum nodes) and can 
be vectorized (within each node). 


Computationa! Fluid Dynamics (CFD) 


Potential solver. Because of 1ts comparatively small com- 
putational work the poiential equation ts still an attrac- 
tive model used very frequently for aerodynamical appli- 
cations where only limited accuracy 1s required. It can be 
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applied tor subsonic as well as for transonic flow. It 1s, 
however, principally restricted to irrotational and 
inviscid flow. 


lts mathematical formulation ts a nonlinear scalar PDE 
which 1s elliptic in the subsonic flow areas and hyper- 
bolic in the transonic areas. The problem 1s solved 
numerically by a parallel multigrid code which provides 
a special treatment of the sonic shock curve. 


Euler solver. The Euler equations are the standard model 
for the description of inviscid flows. They are used for all 
simulations in aerodynamics where the viscosity can be 
neglected. Mathematically they form a coupled non- 
linear system of PDEs for the flow-velocity components. 
the pressure, and the total energy. 


Often it 1s necessary to take viscous effects into account 
near boundaries whereas they are negligible in zones far 
away trom the body surface. This leads to a combination 
of the Euler equations with special boundary layer 
approximations. 


Navier-Stokes solver. The most general CFD models are 
the (compressible) Navier-Stokes equations which prin- 
cipally describe all flow phenomena governed by macro- 
scopic physical rules. Mathematically they form a system 
of PDEs similar to that of the Euler equations with 
additional viscous terms. 


Suprenum offers the parallelized version of the estab- 
lished code Ikarus (by Dornier) for the compressible case 
and the completely new Navier-Stokes solver Liss (by 
(GMD) —based on multigrid methods and designed for 
Suprenum—for incompressible calculations. 


Grid generation. Most, interesting flow phenomena are 

related to geometrically complex domains with curved 

boundaries. Typical examples are the exterior space 

around wings, airplanes or cars or the interior space in 

pipes, turbines, etc. In order to discretize the mathemat- 

ical model (1.e., the system of PDEs) by finite differences 

or finite volumes one needs a grid with the following 

properties: 

— good resolution (especially near boundaries): 

— simple and accurate discretization of boundary con- 
ditions; 

— logically simple structure. 


Suprenum offers 2-D and 3-D grid generators for 
boundary fitted grids (based on Thompson’s method [3]) 
with graphical interfaces. 


Other Applications 


Besides the CFD applications many different application 
software packages have been adapted to Suprenum. Here 
we mention some of them: 


© Structural analysis. A finite element code (PERMAS) 
is currently adapted for Suprenum. The sequential 
linear solver (Cholesky) 1s replaced by a paiullel 
version. 
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© QOuantion chromodvnamics (QCD). These very time- 
consuming simulations can optimally be mapped ‘on 
Suprenum. An SU(2) code is already running; the 
SU(3) code is under development and expected to run 
with nearly 2 Gflop/s. 

© Reactor safety. In the framework of the Suprenum 
project, a thermohydraulic code for the simulation of 
a nuclear reactor core 1s developed. Another ongoing 
activity in this area is the parallelization of the Relap5 
code which simulates the cooling circuit within a 
nuclear power plant 

Parallelization 

Many of the different applications mentioned in the 

previous section are based on either matrix-vector or 

grid data structures. This is not surprising since the 

underlying mathematical model consists of PDEs and 

their discretization most naturally leads to grid struc- 

tures or at least to large matrices. 


Parallelization requires the selection of parallel algo- 
rithms and the distribution of the data structure to the 
local memory units. The data distribution should try to 
preserve /ocality and to achieve load halancing 

Matrix Based Applications 

The basic data structures for linear algebra calculations 
are matrices and vectors. Depending on the particular 
algorithm, matrices are distributed in rows, in columns 
or—the most general method for dense matrices—ipyg 
blocks (submatrices). The distribution 1s chose? 
according to the to!lowing requirements: 

— minimal number of communication steps: 

— minimal length of communicated data: 

— maximal vector length tn each process. 


Vectors are distributed contorming to the matrix distri- 
bution. Complete redistributions (matrix transform) 
Should be avoided whenever possible. 


Parallel algorithms. I he linear algebra algorithms can be 

parallelized on block level. t.¢., submatrices are treated 

independently and simultaneously. In case of matrix 
multiplication this can be done in a straightforward way. 

In the case of Gaubian elimination the dependency on 

pivot elements has to be considered. Within each process 

the algorithms are based on vectorized BLAS routines 

Grid Based Applications 

The Suprenum application packages support two classes 

of grid structures: 

— Regular grids are characterized by direct addressing 
of the grid points and a rectangular or cuboid address 
space. Geometrical neighbors are also logical neigh- 
bors 

— Block-structured grids are composed of several reg- 
ular grids. Each single block shows internally a reg- 
ular grid structure: the block structure itself, how- 
ever, 1S irregular (with certain restrictions). 


In future, also codes based on irregular grids (as used by 
finite element methods) and locally refined grids will be 
implemented on Suprenum. 


Parallel grid algorithms. A grid algorithm 1s a (usually 
iterative) method which calculates the value of a grid 
function at one point as a function of values defined at 
neighboring points (also called relaxation). The iteration 
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can be characterized as Jacobi-type (the new iterate al a 
grid point 1s calculated using only old neighboring 
values) or Gaub-Seidel-type (using already calculated 
new neighboring values). Obviously, Jacobi-type 
methods cre completely parallel since the calculation in 
each grid point can be performed independently (see 
Figure la). If the number of grid points 1s N the paral- 
lelism 1s also N 


The parallelism of Gaub-Seidel methods depends on the 
order in which the grid points are processed. Lexico- 
graphic ordering implies that only points on “wave 
fronts can be calculated in parallel (see Figure 1b). 


For Gaub-Seidel methods, a far better degree of paral- 
lelism. namely V/2, 1s obtained by “coloring” the grid 
points appropriately and processing all points of the 
same color simultaneously. e.g. the so-called red-black 
relaxation (see Figure Ic). 
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Figure i. acobi- and Gaub-Seidel reiaxation schemes. 
e derotes grid points which can be calculated independently in 
+7 ’ . — 

parallel, o denotes grid points with old values, and LW denotes 
grid points with already calculated new values. 


Although the range of grid algorithms for CFD applica- 
tions varies widely they all can be regarded either as 
Jacobi-like (to these belong explicit: time-stepping 
schemes) or as Gaub-Seidel-like or as a mixture of both. 
The parallelism of the grid algorithms in all cases 1s 
sufficiently high for highly parallel systems. 


The same applies if instead of point relaxation schemes 
(as described above) line or plane block relaxations are 


performed, where the values of a whole line or plane of 


grid points are updated simultaneously. The implemen- 
tation of parallel! grid applications on Suprenum ts based 
on the method of grid partitioning (often called domain 
decomposition). 


Multigrid methods. Standard iterative multigrid algo- 
rithms process a cycle from the fine to the coarse grids 
and back to the fine grids sequentially, whereas on each 
grid level the actual problem 1s treated in parallel simi- 
larly to the parallel single grid algorithms. 


The algorithmic and technical details of parallel multi- 
grid algorithms are described in [4.5]. 


Communications library. For grid applications, the 

explicit programming of the communication can be 

hidden from the user. In the Suprenum project, tor 
example, a library of communication routines has been 
developed [6] which ensures: 

— clean and error-free programming: 

— easy development of parallel codes: 

— portability within the class of distributed memory 
computers, programs can be ported to any of these 
machines as soon as the communication library has 
been implemented. 


The library supports regular and block-structured grids 
and is used by most of the Suprenum applications 


Performance 


The quantities of interest in evaluating the performance 

of parallel algorithms are: 

— time 7/N,P): ume to solve a problem of size \V on a 
multiprocessor system using ? nodes: 

— speed-up S(/N.P) := T(N AW TIN P): 

— efficiency E(N,P) := S(N.P)/P. 


Note that on the Suprenum the utilization of the hard- 
ware capabilities 1s the product of the “multiprocessor” 
efficiency as defined above and the efficiency related to 
the vector processing unit. The total problem solving 
tsme—which is the only interesting number from the 
user's point of view—depends. of course, additionally on 
the numerical efficiency of an algorithm. 


In practice & will be smaller than its ideal value |. mainly 
because of communication (including synchronization). 
unbalanced load, and sequential parts in the algorithm 
(Amdahl’s law). It is often claimed that the speed-up on 
parallel systems 1s limited due to Amdahl’s law. This 
does, however, only apply if a constant problem is 
distributed to more and more processors. Realistically, 
the applications are scalable, 1.¢.. the parallel fraction of 
the program increases as the problem size increases 


Since the sequential part of such a program is less 
dependent on the problem size, its fraction 1s not con- 
stant and the assumptions of the classical form of 
Amdahl’s law are not fulfilled (see [7]). 


Performance estimates. A simple analysis shows that 
asymptotically for matrix and grid based applications: 


S(N,P) — P. E(N,P) — | 


if Pis fixed and N —. infinity. For many grid and matrix 
algorithms E depends mainly on N/P, 1.¢. the size of the 
submatrices or subgrids. 


Estimated performance results for CFD applications are 
given in the next two tables. 


3-D potential solver (parallel version of FLO22) with NV 
approximately equals 200.000 grid points: 





22 

P= 16 P= 64 P = 256 
£(200.000.P) 0.98 0.97 Q&S 
Mtlop/s 75 300 1040 


The table shows that tor realistic CFD problems a 
performance of more than | Gflop/s can be expected on 
Suprenum. 


2-D incompressible Navier-Stokes solver: 


65536 262144 


N= 16384 


E(N.256) 0) 62 (VRS OOS 


As predicted by the asymptotic analysis, the efficiency ts 
increasing with growing problem sizes. 
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[Article by Heinz-J. Bast, Michael Gerndt, and Clemens- 
A. Thole: “SUPERB—the Suprenum Parallelizer”’} 


[Text] Although automatic vectorization 1s a well-known 
technique, automatic transformation of sequential pro- 
grams for MIMD execution on distributed memory 
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architectures, like a Suprenum., 1s a research topic. The 
real problem is not the detection of MIMD parallelism 
but the detection of locality in the memory references. 
From the application point of view two basic kinds of 
locality tor memory references are distinguished: matrix- 
type and grid-type. The basic task for automatic trans- 
formation tools for distributed memory architectures 
and computers with memory hierarchies ts outlined. The 
interactive system SUPERB ts oriented to the paral- 
lelization of grid-type problems. The design of this 
system for semi-automatic transformation of Fortran 77 
programs into parallel programs for the Suprenum 
machine 1s given. The system is characterized by a 
powertul analysis component, a catalog of MIMD and 
SIMD transformations. and a flexible dialog facility. 


The Challenge of Automatic Parallelization 


Parallel programs for parallel architectures can be cre- 
ated by explicit formulation of the parallelism using 
special language constructs (e.g-Suprenum-Fortran) or 
special language semantics (e.g. functional languages). 
The application programmer would prefer the automatic 
detection of parallelism in sequential programs written 
in Standard Fortran. 


In the case of SIMD parallelism as it 1s used by conven- 
tional vectorcomputers the generation of vector instruc- 
tions from loops is well known. Comparison of the 
automatic vectorizing compilers of different vendors 
shows the very high quality of their products [1]. 


For some architectures compilers supporting MIMD 
parallelism can be used (e.g. Alliant FX/xx, Convex 2xx, 
Cray-2 and Cray X/Y-MP systems). All of these archi- 
tectures are shared memory computers. The automatic 
parallelization uses different levels of nested loops or 
Strip mining to generate vector instructions to be exe- 
cuted in parallel. The work is assigned to processors 
either in portions of fixed size or dynamically in several 
smaller portions to improve the balance of the loads 
[2-4]. In [5.6] 1t 1s shown that most of the parallelism in 
scientific applications 1s due to parallel work on elements 
of the same large data structures. Parallel execution of 
substantially different threads of code only leads to a 
small degree of parallelism. This means that the 
approach sketched here for the extraction of parallelism 
is appropriate for systems with many processors. 


The challenge of modern architectures 1s not the recog- 
nition of parallelism in the applications but the support 
for locality in the data references to make efficient use of 
memory hierarchies. This is in particular true for local 
memory architectures. 


Several of the state-of-the-art supercomputers contain, 
besides local registers for the processors, small and fast 
local or global memory in front of the huge shared main 
memory with larger latency and smaller bandwidth. 
Examples of this kind of architecture are the ETAIO, 
Cray-2 or the Alliant. Distributed memory architectures 
behave in a similar way. The local memory of a processor 
can be accessed very fast by the processor itself while the 
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access to the memory of other processors has larger 
latency and smaller bandwidth. An essential difference 
between the shared and distributed memory approach 1s 
that in the first case the local memories of the processors 
enclose only a small fraction of the total amount of 
memory available. 


Even for shared memory architectures, a compiler has to 
optimize code for this kind of memory structures by 
minimizing memory references to the non-local memory 
by exploiting the locality of memory accesses in the 
algorithm. 


Scientific applications contain in principle two different 
kinds of locality of memory references: matrix-type and 
grid-type. 


For multiplication of matrices of size m the number of 
operations is 2n* while 3n- data transfers are necessary. 
Furthermore, the matrix multiplication can be decom- 
posed into a series of smaller ones of any size. and the 
favorable ratio of computations and memory transfers 


depends on the size of the submatrices. 


Figure | shows a simple but typical small program for a 
grid-type computation. For a n x nm grid, Sn” computa- 
tions but also at least O(3n-) memory transfers have to 
be executed. This number of memory references can only 
be achieved if each value of the array UOLD has to be 
loaded only once. This means that even UOLD(I-1J) 
and UOLD(I+1.J) do not have to be reloaded when 
UNEW(I,J) is computed. Nevertheless, the ratio of com- 
putations emory transfers 1s small and depends only 
little on the size of the problem. 


PROGRAM RELAX 

PARAMETER (N=100) 

REAL UOLD (0:N+1,0:N¢1),UNEW(O:N#1,0:N¢1), 
& F(O:N+1,0:N+1) 


DO 10 I=#1,N 
DO 10 J=1,N 
UNEW(I,J)=0.25*( F(I,J) 
& + UOLD(I-1,J) + UVOLD(I+1,J) 
6 + UOLD(I,J-1) + UVOLD(I,J+1) ) 
10 CONTINUE 


END 
Figure 1. Example of a subroutine used for a simple 
grid-type problem. 


On the other hand, the computation UNEW at a certain 
element of the array requires only values from neigh- 
boring points. This kind of locality can be exploited only 
if the respective parts of the data structure can be kept 
local to the processor over several executions of the code 
segment and only the boundary values of a subgrid have 
to be updated. 


From both examples it can be concluded that a compiler, 
which shall optimize code for this kind of architectures, 


must be able to partition the data structures in such a 
way that the paris match each other for the desired 
computations. In some sense multidimensional strip 
mining has to be applied to nested loops to exploit full 
locality in the case of matrix-type problems and to yield 
minimal memory transfers in the case of grid-type prob- 
lems if the code segment is executed only once. 


The full locality in the case of grid-type problems can be 
exploited only for distributed memory architectures, 
because only this type of architecture has the feature that 
the entity of the fast local memory forms a significant 
amount of memory. In this case the parts of the data 
Structures have to be assigned for a longer period of 
computation statically to a specific processor, and only 
boundary information of the partitioned grid structures 
will be passed to the memory of other processors. 


As shown in the following section the interactive paral- 
lelizer SUPERB—result of a research activity at the 
university at Bonn—was designed according to these 
requirements. 


Structure of the Parallelizer 


SUPERB (SUprenum ParallelizER Bonn) 1s a semi- 
automatic source-to-source parallelization system. In 
contrast to existing paralielizers, SU PERB is designed to 
combine both MIMD and SIMD parallelization into one 
integrated interactive system that 1s oriented towards the 
Suprenum computer and its application for large-scale 
scientific Computing. 


As already mentioned in the first section, automatic 
SIMD-parallelization is a well-known task. but it 1s 
extremely difficult to detect parallelism for systems with 
distributed memory automatically. In SUPERB, data 
partitioning—the only useful way to extract enough 
parallelism for this kind of machine—has to be done 
interactively. The user assigns parts of the data domain, 
e.g. a grid or a grid hierarchy represented by the arrays of 
the program, to specific processes. Due to the inherent 
incompleteness of analysis information the system 
cannot automatically extract the global relationships 
between the program’s arrays necessary to obtain effi- 
cient parallel code 


In principle. there are no restrictions on the kind of 
programs SUPERB can be applied to. However, to be 
successful in MIMi)-parallelization, the programs 
should work on a mesh or mesh-like data domain, the 
computation at the mesh points should be local and the 
problems to be solved should be large 


The overall structure of the system 1s depicted in Figure 
2. The main components and the overall parallelization 
process may be outlined as follows: the front-end, the 
core, the transformation catalog and the backend. (A 
detailed description of the structure can be found in 
[7.8].) 
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Figure 2. Structure of SUPERB. 


The front-end first transforms a Fortran 77 program into 
an internal representation (attributed abstract syntax 
tree. symbol table, call graph). After splitting the original 
program into the initial task (running on the front-end 
machine and performing I/O) and the subtask (describ- 
ing the actual computation), in both tasks, control flow 
(IF-conversion) DO-loops, expressions and special For- 
tran-features are normalized to facilitate application of 
other transformations and analysis of the program. Some 
initial analysis information, such as sets of variables 
used or defined in a statement (in particular if there are 
calls to other program units) and control flow relation- 
Ships describing possible execution sequences of the 
program, are computed. 


The core—the main part of the system—controls the 


execution of the other system parts, provides a catalog of 


transformations and analysis services, and contains the 
interface to the user 


The transformation catalog—organized in a hierarchical 
structured set of menus—offers a number of transforma- 
tions (e.g. for MIMD-parallelization) the user can select 


from. The analysis component verifies the existence of 


preconditions necessary for the application of a transfor- 
mation and supplies the user with details about his 
program. For example, information between statements 
in loops, interprocedural relationships, references which 
require communication between processes or conflicts 
caused by the currently selected data partition can be 
computed and displayed. 


The back-end produces the final Suprenum-Fortran 
code. Vector code (corresponding to Fortran 8x syntax) 
is generated for all vectorizable statements. The infor- 
mation collected during the interactive parallelization 
process 1s used to insert correct send/receive statements. 
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Some final opt:mizations are performed to increas 
efficiency of the generated code 


A Small Example 


The interactive parallelization process is described 
below by using the small example program in Figure | 


After applying the front-end to the program, the use 
determines a partition specification. This specification 
describes a set of partitioned arrays and the informatior 
for mapping segments of these arrays to selected pro 
cesses. 


Here the arrays UVOLD, UNEW and F are partitioned 
into segments as shown in Figure 3. Using special 
analysis services offered by the core the user can look at 
the communication overhead resulting from this parti- 
tion. 
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Figure 3. Partitions of UOLD, UNEW and F. 


In the second phase the user identifies critical code 
sections, 1.e., sections Causing much communication. He 
can try to optimize the communication by applying 
transformations like scalar forward substitution. induc 

tion variable substitution or special MIMD transtorma- 
tions. Beside these optimizations he may change the 
partition specification 


In the example program, array elements which have to be 
exchanged between processes are described by an 
overlap area around the segments (see Figure 3). All 
array clements read by process p and not local to p have 
to be in this overlap area. Here, each process has an 
overlap area of width one in every direction 


Now the user 1s able to improve the vectorization of the 
code interactively. The analysis component offers him 
the possibility to examine dependence information in 
loops. Thus cycles in the dependence graph preventing 
vectorization can be detected. To remove such cycles. 
the user may apply transformations such as scalar expan- 
sion and loop distribution to selected loops or whole 
units. In this phase no vector code 1s generated, but loops 
are marked to be vectorizable 


In our example both loops can be vectorized 
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In the last phase the user applies special MIMD trans- 
formations to further optimize the communication. 
These transformations extract communication from 
loops and combine small messages into larger ones. 


Figure 4 shows the result of the transformation process 
applied to the example. The communication between the 
processes is organized by the EXCH statements 
according to the overlap specification. All array elements 
in the overlap area of the segments are actualized. L1, 
R1, L2, R2 are variables containing the bounds of the 
segment assigned to the executing process. 


CALL EXCH (UOLD, 0, N+l, 0, N+], 

& P4,3,8,3 b «2.3 

UNEW (L1:R1,L2:R2) = 0.25 (F (L1:R1,L2:R2) + 
VOLD (L1-1:R1-1,L2:R2) + 
UVOLD(L1+1:R1+1,L2:R2) + 
UOLD(L1:R1,L2-1:R2-1) + 
VOLD (L1:R1,L2+1:R2+1) ) 


oP aS a 


Figure 4. Transformed code segment. 


A more detailed description of MIMD parallelization in 
SUPERB can be found in [9]. 


A prototype of the parallelizer is completed and can be 
demonstrated. Currently, some additional transforma- 
tions are being implemented and parts of the parallel- 
izing process improved so that they work without direct 
user assistance. The user will be able to define abbrevi- 
ations for frequently used sequences of transformations. 
The application of these transformations will be done 
automatically to selected parts of the program if the 
corresponding macro 1s envoked. 
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